In Class Exercise - Sept 19: Managing Tabular Data to Visualize College Health

Authors

Jamaal Green, PhD

Ryan Smith

Published

September 19, 2024

Potential skills used – R (dplyr), manipulating and working with tabular data

Background

Prof. O’Neill is a principal researcher at U3 Advisors, and one of the major projects he’s led over the past few years is exploring the fiscal health of America’s colleges and universities. While the pandemic seeded the latest round of enrollment issues at many universities, the problem of falling enrollment/increased expenses is decidedly not new. As part of that project, Prof. O’Neill tracked over 2,300 colleges and universities. In examining these schools, they have uncovered some interesting, and troubling, geographic patterns of fiscal risk. To better track these patterns of potential closures, they set up a Tableau dashboard to visualize risk.

Goals for Today

Today, we will be introduced to an essential part of our GIS practice- manipulating tables. As I said in our first class, spatial isn’t special (until it is). All of our information is stored in tables (even our spatial data). To be successful GIS practitioners we must be experts at manipulating and working with tabular data.

We will be making extensive use of the tidyverse set of packages today, specifically dplyr. While it is possible to do the following table operations in ArcGIS Pro, it is fundamentally more awkward, takes longer, and is less reproducible. We will include the steps one can take in ArcGIS Pro, and if you want to go that route you are welcome to, but I would recommend that you really focus on the dplyr approach and get familiar with it as it will help to prime you for further success.

Let’s Get Started

If you have not yet, set up a folder for this lab in an appropriate area and make a new R project for this assignment if you need it. If you have organized your folders slightly differently so you do not need a new R project, that’s fine also. I just want to make sure you know where your data and work are and that you can manage them easily.

Importing Our Tables

if(!require(pacman)){install.packages("pacman"); library(pacman)}
p_load(tidyverse, sf)

schools <- st_read(here::here("labs_exercises/2024-09-19_week4_college_health/2024-09-19_CollegeHealth_In_Class/Schools.dbf"), quiet = TRUE)
The legacy packages maptools, rgdal, and rgeos, underpinning the sp package,
which was just loaded, will retire in October 2023.
Please refer to R-spatial evolution reports for details, especially
https://r-spatial.org/r/2023/05/15/evolution4.html.
It may be desirable to make the sf package available;
package maintainers should consider adding sf to Suggests:.
The sp package is now running under evolution status 2
     (status 2 uses the sf package in place of rgdal)
Breaking News: tmap 3.x is retiring. Please test v4, e.g. with
remotes::install_github('r-tmap/tmap')

Okay, already you should see something new. The opening lines here are something I picked up from one of my mentors, Prof. Liming Wang, at Portland State. The pacman package is a package to make it easier to manage R packages.

  • Let’s break this down…the preamble here basically says if the require() function fails, then install and then load up the pacman package using install.packages() and then library()
    • require() is basically equivalent to library() but is used inside functions

We will cover more on conditionals later in the term. The p_load() function will check to see if you’ve installed the listed packages, and if not, it will install them for you and also load them. So it combines the install.packages() and library() functions into one!

Next, we will use the st_read() function from sf to read the .dbf file for schools.

  • .dbf files are the files that hold the tabular data for our shapefiles. Because it is a table we can read it directly and st_read() gives us that functionality.

Now you should also see something new and that is using the here() function from the here package. here is a package to make it easier to manage your file paths.

  • The package will find your root folder (in this case, it’s the folder that holds our R Project file) and will make sure your paths are all relative to that root folder.

  • This is especially useful if you have folders with different objects you need that are still within your project directory, but not within the folder whatever script you are running is currently in.

  • Again, don’t worry about this too much, but if you want more info the vignette for here can be found at this link.

So, our schools dataframe should have three columns: a school ID (UnitIDs), longitude, and latitude. While interesting, this does not give us that much useful information. So, let’s bring in some additional data for our schools.

school_data <- st_read(here::here("labs_exercises/2024-09-19_week4_college_health/2024-09-19_CollegeHealth_In_Class/Schools_Data.dbf"), quiet = TRUE)

What do we see? We should notice our school_data dataframe has the same number of rows, but now we have ten columns. So, how can we associate the school’s data with our school IDs and point coordinates? A table join!

Making Our Joins

Table joins are absolutely essential operations. Get comfortable with them, because you will use them all the time from here on out. We can join our two tables based on the IDs of both tables.

  • Now, while dplyr is usually smart enough to join columns of different data types, in this case, we will be extra careful and convert the ID column in our school_data dataframe to a character column, and then join the tables.

  • We will make use of the mutate() function to modify our ID column in place and then use the inner_join() function to join the tables.

school_data <- school_data %>% 
  mutate(UnitID = as.character(UnitID))

schools <- schools %>% 
  inner_join(school_data, by = c("UnitIDs" = "UnitID"))

Okay, let’s walk through what we did here. We modified our UnitID column in our school_data dataframe using the mutate() function from dplyr and converted it from a numeric to a character using the as.character() function. Then, we modified our schools dataframe by joining it to the school_data dataframe using the inner_join() function.

  • We used the by = argument to tell dplyr what columns from the two tables we want to join. If the two columns had the same name, then inner_join() would assume we want to join them and try to join them.

  • While this can be a helpful default, it also means we must be very careful when thinking about joining our tables if they have shared column names.

Creating New Variables of Interest

Next, we will calculate two new variables: the share of low income undergraduate enrollment at these institutions; and, how dependent our institutions are on in-state residents.

Low-Income Undergrad Share

Our first metric, the proportion of enrolled undergraduates with low incomes is a straightforward calculation. We will again make use of our mutate() function:

schools <- schools %>% 
  mutate(low_inc_per = LowIncUG/TotalUG)

I hope this looks straightforward. We take our schools dataframe and pipe it into our mutate() function where we create a new variable, low_inc_per, which is the quotient of low-income undergrads and total undergraduate enrollment. This variable appears in the last column of our schools dataframe.

State Dependency

This variable involves categorizing the StatePop variable into two categories: high or low state dependency. To do this, we will use an if_else(), also known as a conditional statement.

  • Conditional statements are used when we want our machines to perform some kind of operation based on some condition. The basic logic of if_else() statements is thus: if [some condition = TRUE] then [do something] else [some condition = FALSE] [do some other thing].

In this case, we will categorize universities with greater than or equal to 65% in-state enrollment as “High State Enrollment” and all other schools as “Low State Enrollment”.

schools <- schools %>% 
  mutate(state_dependent = if_else(StatePop >= .65, "High State Enrollment", "Low State Enrollment"))

Okay, let’s break this down. We take our schools dataframe and pipe it into mutate(). We create a new variable called “state_dependent” that is equal to if_else() our “StatePop” variable is greater than or equal to .65, then give it the label High State Enrollment, otherwise call it Low State Enrollment. Conditional statements like if_else() are found everywhere and are incredibly useful. They can quickly become unwieldy, but we will have solutions for that later in the term. For now, just see if you understand the logic behind what we are doing.

Creating Summaries

We have created new variables, both continuous and categorical, and now we will create some summary values from our schools dataframe. We will make use of the group_by() and summarise() functions from dplyr. We are looking to get the median for the share of low-income undergraduate students and the median earnings of students after graduation by the university risk category.

schl_lowinc_earn <- schools %>% 
  group_by(RiskCatego) %>% 
  summarise(low_inc_median = median(low_inc_per), 
            med_earnings = median(MedEarn))

Alright, let’s break this down. We take our schools data frame, pipe it into our group_by(), where we group on the “RiskCatego” variable, and then we pipe that into the summarise() function, creating two new variables: low_inc_median and med_earnings. Both of these variables take the medians of the original variables using the median() function, which comes built into base-R. Take a look at this new table: Do we see any potential connections among these variables by risk category?

Combining Spatial and Table Operations

Okay, thus far we have gone through an example of a table join using our inner_join() function. We have created new variables using our mutate() functions, and figured out how to calculate group summary statistics using group_by() and summarise(). We have also explored the use of conditionals with the if_else() function. We have done a lot thus far, and now we will take it to the next level by combining spatial operations using the sf package with our dplyr functions to explore the spatial variation of our variables.

Making Our Data “Spatial”

Remember, spatial isn’t special (until it is). Functionally, “spatial” is “only” a new column in our dataframes(it’s more complicated than this, but what we see is just a new column). In this section, we will have to figure out how to associate our schools with basic information on the states our schools are in and their metro areas. To do this, we will introduce a series of new functions from the sf package.

First, let’s turn our schools dataframe into a simple features dataframe by using the st_as_sf() function.

schools_sf <- schools %>% 
  st_as_sf(coords = c("Long", "Lat"), crs = 4326)

Let’s break this down because there’s a lot here that is new:

  • We create a new dataframe called schools_sf (schools simple features) and we pipe our original schools dataframe into st_as_sf(). st_as_sf() is a function that can allow you to make a standard dataframe into a simple features dataframe if it has the appropriate geographic information within it.

  • Because our schools are “points” data we use the coords = argument to tell sf we have point data and that the columns that correspond to our coordinates are “Long” (longitude/X) and “Lat” (latitude/Y).

  • Next, we have the crs = argument. “crs” stands for “coordinate reference system” and we use the EPSG code “4326” for WGS84, a standard geographic coordinate system often used for global mapping as well as web mapping.

    • There are mountains of writings on projections, coordinate systems, and the like, but for now, just know that we have lat/longs and they are projected in WGS84.

Schools by Risk Category

After we converted over we can do a quick map to see if our coordinates and projection are correct and it looks good (don’t worry about how I mapped this just yet, it’s purely illustrative).

Okay, so now our schools_sf dataframe is correctly spatial. Now we have to figure out a way to join it to our state and MSA data.

Spatial Joins and Summaries

Okay, let’s check where we are now. We have our schools_sf dataframe that is now a proper simple features dataframe projected in WGS84. But, ultimately, we want to get summary information by state and MSA. Our next steps will involve bringing in our state and MSA data, joining it to our schools_sf dataframe, and then calculating some summary values. Let’s get to it.

First, let’s bring in the MSA and States shapefiles using the st_read() function from sf. Something to note, when you load in a spatial file using st_read() sf gives you information on the CRS of your spatial file.

states <- st_read(here::here("labs_exercises/2024-09-19_week4_college_health/2024-09-19_CollegeHealth_In_Class/States.shp"), quiet = TRUE)

msa <- st_read(here::here("labs_exercises/2024-09-19_week4_college_health/2024-09-19_CollegeHealth_In_Class/SelectMSA.shp"), quiet = TRUE)

So, now that you’ve loaded in these new spatial files, do you notice anything different about these layers? Note, that the CRS of the state and MSA files are different than that of our schools data. What can we do? Well, we can reproject layers to different projections using the st_transform() function. Let’s do that now.

schools_sf <- schools_sf %>% 
  st_transform(crs = st_crs(states))

Let’s break this down:

  • we take our schools_sf dataframe and pipe it into the st_transform() function. st_transform() is designed to change the projections of simple features dataframes.

  • The crs = argument sets the coordinate reference system we want to project our simple features dataframe to. In this case, we use the st_crs() function within the crs = argument to pull the coordinate reference system information from our states simple features dataframe.

  • This is a straightforward way to change the projection of one of our layers to the projection of another.

Just to make sure let’s see if we can map these two layers (again, don’t worry about how to map it in R just yet, we’ll get to it later in the course).

Colleges On Our States

So, now we have our layers sharing their appropriate projections…but how can we associate our state information with our colleges? The two dataframes do not have a shared column so we cannot do a table join. This is where the spatial join comes in. The logic of the spatial join is straightforward: we associate the attributes of one object to another based on their spatial relationship to each other. In this case, we will assign state information to the colleges/universities that are within them. This kind of operation is often called a point-in-polygon operation. Let’s take a look.

schools_state_sf <- schools_sf %>% 
  st_join(states, join = st_within)

schools_state_msa_sf <- schools_state_sf %>% 
  st_join(msa, join = st_within)

Okay, let’s break this down. First, while I create new dataframes in this lab, you do not need to do so in your everyday work. I’m doing this to make it clear what we are doing with each step, but, generally, we don’t need to clutter our environment. Take a look at these two new simple features dataframes and think about what’s different between them. How many rows do we have? How many columns?

Summary Calculations With Our Geographic Data

Let’s take stock of what we’ve accomplished. We have learned how to calculate group summary statistics, worked on creating and categorizing new variables, and we have learned how to perform spatial joins. That’s a good bit thus far. So, what’s next?

Now, we will calculate some summary statistics by state using our spatially joined dataframes. In doing so, we will be making use of a new function- filter().

First, what are we calculating? I want you all to calculate the median of the share of low-income undergraduates and earnings after graduation by state and by university risk category for the states of Illinois, Michigan, and Texas.

There are multiple ways to attack this, but let’s start by pulling out our states of interest. As a reminder, we’ll be making use of the filter() function from dplyr. filter() is a function designed to subset our dataframes by row based on some condition.

tx_mi_il <- schools_state_sf %>% 
  filter(STATE %in% c("Texas", "Michigan", "Illinois"))

First, let’s check if this call passes a preliminary test. How many rows do we have? How many columns? Logically, how should the shape of our dataframe change?

Now let’s break this call down even more because you should notice something new here:

  • The %in% operator is used to search within a vector. So our STATE %in% c("Texas", "Michigan", "Illinois") within filter() tells dplyr to go into our dataframe and to pull out values from the STATE column that match the values in our vector.

  • This is a very, very handy little tool when working with categorical/text data if you know exactly what you want to pull out.

Okay, now let’s do a quick visual check. Yup, the colleges/universities in our layer are all either in MI, TX, or IL. You can check this yourself if you export out your dataframe to a shapefile or geopackage and map it yourself in ArcPro.

IL, MI and TX Schools

So, we have our schools, let’s calculate our summary statistics. We want the median enrollment share of low-income undergraduates and income after graduation by risk category and by state. We will make use of group_by() again here. group_by() really shows its strengths here because we group on multiple variables at the same time! Let’s take a look.

tmi_summary <- tx_mi_il %>% 
  as_tibble() %>% 
  group_by(STATE, RiskCatego) %>% 
  summarise(low_inc = median(low_inc_per), 
            med_earnings = median(MedEarn))

Okay, you should notice something else new here. After we create our new dataframe and start to assign it, we pipe our tx_mi_il dataframe into this as_tibble() function.

  • Why is that? Well, remember our tx_mi_il dataframe is a simple features dataframe, so it has geometry associated with it. The creators of sf made it such that if we do some kind of grouped operation, not only would we get summarized versions of our tabular attribute data, but it would also try to group our spatial data.

  • This can lead not only to operations taking much longer than they should (manipulating complex geometries is very computationally expensive), but you can get weird nonsense results dumped into a geometry column we don’t actually care about because we are not looking to map anything in this particular table.

  • A quick way around this behavior is to convert our simple features dataframe back to a regular dataframe and as_tibble() does that for us.

So, now we have our summary dataframe. How many rows does it have? How many columns? Does that make sense to you? Why or why not? Do we see different risk profiles by geography?

Let’s take a look at a graph (don’t worry too much about how to make this graph yet, we’ll get deeper into plotting later in the term).

Warning: package 'ggthemes' was built under R version 4.3.3

Share of Low-Income Undergrads by State

So, we’ve done a lot thus far. We have worked through table joins and spatial joins; we have worked on grouped summaries both with single and multiple groups; we have explored a bunch of new functions such as %in%…in short, we’ve covered a LOT. Pat yourselves on the back and let’s move to part 2 of the lab.

Doing It Live…

For the second part of the lab, I want you to use the skills we’ve covered in this lab and join the “MSA_Data_Join” and “State_Data_Join” to their respective spatial files, export them and map the “ColorRange” variable. “ColorRange” represents the proportion of school in that state or MSA at risk of closure.

Completing The Exercise in ArcGISPro

The following are instructions to complete the steps outlined above, but instead of manipulating tables in R, you’ll be completing the steps in ArcGIS Pro. You will be cleaning and joining tabular data to spatial data to visualize the financial health of certain colleges and the resulting impacts. You’ll analyze both colleges themselves (Part 1) and then geographies (Part 2).

Part 1

  1. Drag in States and Select MSAs. You can turn these two shapefiles off for now; they are intended to “prime the dataset” (i.e., align us to our desired Coordinate system). Eventually, in Part 2 we will join some summary data at this level as well.

  2. Drag in Schools.dbf. This is a sample of ~1,200 four-year public and not-for-profit colleges and universities that offer undergraduate degrees. As of now, it tells us nothing more than a formal ID and where they are located (Lat/Long). Use the Longitude (X) and Latitude (Y) fields to map. Remember to pick a coordinate system that has the appropriate units that match up with the units in Lat/Long (Degrees).

  3. Next, we will want to visualize and summarize some data about these schools, but first we will need to join the data from schools Schools_Data (see the end of this document for an index of what each field means).

  4. The Join cannot happen automatically. Notice that the UNIT ID, which seems to be a shared field in both, is stored as Text in Schools and Integer in Schools_Data. You will have to fix one of these before you can join. To do so, you’ll want to create a new field that is either a Text or an Integer, and use Field calculate to “take the value” from the appropriate Unit ID field.

  5. Once you have two fields that are the identical type in each dataset, right-click the Schools that you plotted on the map and select Join. Join the Schools_Data. Confirm that the joined worked (i.e, open the attribute table and scroll to see if there are any NULLS).

  6. We’re going to want to edit the table, but it is much easier to do that when the join is not “live.” So right-click on the MappedSchools and select Export Features. Make sure you save this as a shapefile somewhere on your computer where you can easily access it.

  7. Now that we have the data joined, we’ll want to do a simple visualization. Make sure that MappedSchools is selected. Go to the appearance tab at the top of the ribbon and select Symbology. From the dropdown menu select Unique Values.

  8. Under the field option, select “Risk Category” (this field is a rough summary of the level of risk that a given school has for closing in the next few years). Add all four values and create a symbology perhaps that has Serious = Red; Some = Orange; Stable with Risk Light blue; and Stable = Blue. Zoom around the map and see what you’ve created. **Advanced Technique 1**

  9. Next we will want to Add and Calculate some fields for later. I want you to create two fields: % Low-Income Undergraduates (but call it PLowIncUG) – this should be a DOUBLE or FLOAT; Dependency on State Residents (but call it StateDep) – this should be a TEXT.

  10. Next, we will right-click each field and select Field Calculator.

  11. We will start with the first field we created: [PLowIncUG]. Because we are looking to calculate the percent of low-income undergraduate students, we want to divide [LowIncUG] by [Total UG], Be sure to click each of the options, as opposed to directly typing them in, to prevent any syntax errors.

  12. For [StateDep], we will use the field calculator to quickly summarize the percent of in-state residents enrolling in a school’s undergraduate body. Begin by sorting the table by the [StatePop] field in either ascending or descending order. Next, in the table, manually select everything that is 65% or over. You can do this, just click on the first row and scroll down to the last row that would meet the criteria, hold shift, and select this as well.

  13. Next, you will engage the Field Calculator and click in the phrase “High State Enrollment” (include the parentheses) Then, in the attribute table, you will see an icon at the top that says Switch; select this to invert the current selection.

  14. Now, in the Field Calculator window, you’ll type in “Low State Enrollment.” At this point, you have you have roughly categorized the schools by whether in-state enrollment constitutes more or less than 65% of the undergraduate student body.

  15. Given our data, we will need to do a series of Summaries and Selections to finalize our analysis.

  16. First, we will summarize different variables by Risk Level. Right-click on [RiskCatego] and select Summarize. Make sure the Case field is [RiskCatego]. Then, under Statistics, select two things: [PLowIncUG] – Median; And MedEarn – Median (this is the median earnings of students from this school after ten years).

  17. Save the table and open the results. Make a note of how these numbers vary by schools of different risk tiers. Are there more low-income students at higher-risk schools? What is the difference in long-term earnings from schools with greater financial risk indicators than those that are stable?

  18. We are also interested in seeing the role GEOGRAPHY plays in all of this. Take a look at Texas, Michigan (or Texas and Illinois). Take the Select tool at the top of the screen and “roughly” try to select as many from the State of Texas as you can (it does not need to be perfect) and hit “statistics” for [PLowIncUG] and [RiskScore]. Record the values for these. Do the same thing for Michigan or Illinois, and record the values as well. How perceptible is the geographic difference here **Advanced Technique 2**

Part 2

Now, we’ll want to see how states and MSAs would score overall, given the number of schools within those geographies at risk of closure in the next few years.

  • For this part, I won’t walk you through the steps. I simply want you to take the information that exists in the tables MSA_Data_Join and State_Data_Join and join them to the respective MSA and State data files. Once they are joined, use the [ColorRange] field to visualize the data. This field represents the % of schools within that state or MSA at serious risk of closure. **Advanced Technique 3**

Advanced Technique 1

To prepare yourself for next week’s lessons on symbology, see if you can map some of the more quantitative elements of the schools. Can you color or size them by their enrollment? By their specific risk score? Can you create a pie chart that looks at low-income vs. non-low-income students?

Advanced Technique 2

What if there was a way to select all schools in Texas, Illinois, or Michigan without using your mouse to do it manually? Could you explore the Select by Location Tool and see if there was a way?

Advanced Technique 3

Pretend the [MSA] field did not exist in the MSA_Data_Join table. How could you combine the two fields at the end [MSAName] and [StateName] into a single field that would join with the MSA shapefile?

To Turn In

In a single document (the format is up to you, such as a word doc, pdf, powerpoint etc…) I want you to produce 5 maps (made either in ArcPro or R if you feel ambitious).

Maps to Turn In

  • Create a national map of schools by their risk category

  • Create a national map of schools by the proportion of low-income undergraduate students (play with the symbology here to see what works as an effective visualization for you)

  • Create a map of a state of your choice of the schools by the state dependence categorical variable you calculated

  • Create a national map of states with the proportion of schools at risk of closure

  • Create a national map of MSAs with the proportion of schools at risk of closure

For each map, give a a few sentence on what you see and why you used your design choices.